EST clustering error evaluation and correction
نویسندگان
چکیده
MOTIVATION The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated. RESULTS We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5% for both 5' and 3' EST clustering, the Type I error in the 5' EST case is approximately 10 times higher than the 3' EST case (30% versus 3%). An over-stringent identity rule, e.g., P >/= 95%, may even inflate the Type I error in both cases. We demonstrate that approximately 80% of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5' EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile.
منابع مشابه
A Bayesian Nonparametric Approach for Comparing Clustering Structures in EST Libraries
Inference for Expressed Sequence Tags (ESTs) data is considered. We focus on evaluating the redundancy of a cDNA library and, more importantly, on comparing different libraries on the basis of their clustering structure. The numerical results we achieve allow us to assess the effect of an error correction procedure for EST data and to study the compatibility of single EST libraries with respect...
متن کاملRetrieving Arabic Printed Document: a Survey
This paper surveys some of the literature pertaining to searching and retrieving OCR’ed printed documents with emphasis on Arabic documents. It examines peculiarities of Arabic morphology, orthography, retrieval, word clustering, display, OCR, and error correction. The paper surveys existing evaluation test-beds for retrieval of Arabic OCR texts. Lastly, it concludes with possible directions fo...
متن کاملk-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage
MOTIVATION The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using si...
متن کاملAn approach to fault detection and correction in design of systems using of Turbo codes
We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...
متن کاملA Sub-Optimal Look-Up Table Based on Fuzzy System to Enhance the Reliability of Coriolis Mass Flow Meter
Coriolis mass flow meters are one of the most accurate tools to measure the mass flow in the industry. However, two-phase mode (gas-liquid) may cause severe operating difficulties as well as decreasing certitude in measurement. This paper presents a method based on fuzzy systems to correct the error and improve the reliability of these sensors in the presence of two-phase model fluid. Definite ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Bioinformatics
دوره 20 17 شماره
صفحات -
تاریخ انتشار 2004